人物专栏 | Giuseppe Samo博士访谈
点击上方蓝字关注我们
编者按
《理论语言学五道口站》(2021年第4期,总第138期)“人物专栏”与大家分享本站采编人员王平对Giuseppe Samo博士进行采访的访谈录。Giuseppe Samo博士,北京语言大学语言学系副教授,主要从事语言学与人工智能、统计学、实验方法以及语言习得的接口研究。本期访谈中,Samo博士同我们分享了他对编程语言对语言学实验的影响、句法研究与人工智能的关系的看法,对于如何评价量化方法以及如何进行人工智能的学习给出了自己的建议,最后,他预测了人工智能将给跨学科研究带来的挑战。
人物简介
Giuseppe Samo博士
Giuseppe Samo博士,北京语言大学语言学系副教授,也是北京语言大学首位正式聘用的在编外籍教师。他教授的语言学课程重点关注语言学理论与人工智能、统计学、实验方法、语言习得及历时研究等方面的接口。Giuseppe Samo在瑞士日内瓦大学获得博士学位,他运用句法制图建立形式模型,进而解释V2语言的句法微变化。他的研究方向包括句法理论与语用学、计算机语言学和数据科学的接口研究。与此同时,Samo博士也努力促使语言学理论广泛传播,以为更多人所熟知。
代表作:
Samo, G. (2019) A Criterial Approach to the Cartography of V2, John Benjamins Publishing, ISBN 9789027204486.
Samo, G., Merlo P. (2019) Intervention effects in object relatives in English and Italian: a study in quantitative computational syntax, Proceedings of the First Workshop on Quantitative Syntax (Quasy, SyntaxFest 2019), Association for Computational Linguistics, 46 – 56
Brief Introduction
Dr. Giuseppe Samo is associate professor and researcher at Department of Linguistics, Beijing Language and Culture University, where he teaches courses of linguistics focusing on the role of linguistic theory at the interfaces with Artificial Intelligence, statistical and experimental methods, language acquisition and diachronic studies. He received his doctor degree at the University of Geneva (Switzerland), working on a formal model to account for syntactic micro-variation among V2 languages adopting cartographic analytical tools. His interests include the role of syntactic theory at the interfaces with pragmatics, computational linguistics and data science. Finally, he carries dissemination activities concerning linguistic theory to a more general public.
Selected publications:
Samo, G. (2019) A Criterial Approach to the Cartography of V2, John Benjamins Publishing, ISBN 9789027204486.
Samo, G., Merlo P. (2019) Intervention effects in object relatives in English and Italian: a study in quantitative computational syntax, Proceedings of the First Workshop on Quantitative Syntax (Quasy, SyntaxFest 2019), Association for Computational Linguistics, 46 – 56
访谈内容
01.
王平:目前存在多种编程语言例如C++、Java与Python,您认为这些编程语言的使用会对语言学研究产生影响吗?您有没有过通过使用编程语言来完成语言学实验的经历,可否举例说明?
Giuseppe Samo博士:会使用编程语言无疑是一种优势,这种技能可以与技术语言知识相融合。比如说,在我看来,掌握Python语言有助于处理、分析以及提取数据,并且配合相关资料的使用,可以可视化海量数据。幸运的是,许多计算机科学家与计算机语言学家一直在努力,目的是让现存的用户友好型的工具变得更加容易使用,最终可以让毫无相关编程知识的理论语言学家使用这些工具。如果有可能的话,我个人总是倾向于这种选择:使用这些现存工具处理数据,无论是从时间还是质量来说,都是最佳选择。此外,这一选择代表了两个学科良性互动的方法,并且可以随着新技术与新方法与时俱进。
02.
王平:计算机语言学以及自然语言处理并不是刚刚出现的领域,早在1946年,人们已经开始尝试用电脑来处理自然语言。在自然语言处理的研究中,计算机如何处理以及理解人类语言,尤其是复杂句,颇受重视。那么,您认为句法结构的研究可以怎样促进自然语言处理的研究?
Giuseppe Samo博士:在自然语言处理的研究历史上,句法形式的研究起了重要的作用。当然,数据导向的纯统计学方法已大展身手,然而,也有其不能解决的问题。在近期自然语言处理的相关文献中,语言理论正在恢复其在发展可明确描述某些操作的正式机制方面的地位:例如,在许多人工智能系统中,一种非常常见的做法是所谓的将单词转化为向量,粗略地说,这是从单词与其他单词共现的频率统计中得出的。语言学理论证明并不是所有的词都有相同的性质:形容词、名词、主语、宾语等,都代表了不同的“数学”指标,例如向量或矩阵(见于Baroni M. & Zamparelli R., 2010, Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space, Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, 1183–1193)。对于复杂结构及人类的研究仍然坚持聚焦于评估表现的精确性,例如Gulordava及其合作者的研究(Gulordava et al. 2018. Colorless Green Recurrent Networks Dream Hierarchically. In Proceedings of NAACL, 1195–1205),设立了人类对照组,并结合神经网络的结果检测不对称性。
03.
王平:在使用语料库进行语言研究时,通常采用量化方法,即通过分析某一实体在大量数据中的出现频率来进行语言研究。在您看来,我们应如何评价量化方法在语言研究中的优缺点呢?
Giuseppe Samo博士:自20世纪60年代人们对乔姆斯基的理论展开讨论以及乔姆斯基提出“Colorless green ideas sleep furiously.”这句名言起,有关量化问题的讨论就成为了人们长期争论的焦点。在这个问题上,我个人比较赞成Merlo 教授的观点(Merlo P. 2016.Quantitative computational syntax: some initial results, Italian Journal of Computational Linguistics. 2016.2, 1.11–30)。Merlo 教授认为频率是语法研究的因变量:在我看来,频率在语言研究中并不能作为一个因素(相关观点见Ibbotson P. 2013.The Scope of Usage-Based Theory, Frontiers in Psychology, for an overview),但它也绝对不是一个完全无关的变量。最近,我与Paola Merlo教授合作完成了一篇论文(Samo, G., Merlo, P., 2019, Intervention effects in object relatives in English and Italian: a study in quantitative computational syntax, Quasy, 46 – 56),我们认为频率与结构的复杂性息息相关:操作复杂的结构出现的频率往往更低。然而,如果我们使用语料库,我们仍然面临着反事实数据的问题。换句话说,我们可以这样认为,定量方法丰富了定性维度,两种分析方法并不是完全对立的。
04.
王平:您曾说句法制图理论对人工智能的发展有重大影响,相反,人工智能会促进语言学的发展吗?能否举例说明?
Giuseppe Samo博士:首先请记住,人工智能大致等同于统计计算。这个问题是一个非常重要的话题,特别是关于语言理论如何理解神经网络中发生的情况,而神经网络现在仍然是一个黑匣子。Tal Linzen教授和Marco Baroni教授最近的一篇文章(2020, Syntactic Structure from Deep Learning, in press at Annual Reviews of Linguistics, downloadable at: https://arxiv.org/abs/2004.10827)旨在修改现有的文献,并讨论一些从理论角度解释这些结果的方法。我们需要做的就是将机器的结果映射到我们的形式表达和研究中。另一方面,从实用主义的角度来说,通过语言标注添加句法信息所给出的更精确的数据可以帮助机器在许多语言任务中得到改进,从机器翻译到信息检索,特别是对于那些资源不足的语言,人工智能都发挥了重要作用。
05.
王平:长期以来,语言学和计算机科学一直被视为互不相关的专业。如果语言学专业的学生想学人工智能语言学或者计算语言学,您对这些学生有什么建设性的建议吗?
Giuseppe Samo博士:语言是所有科学的表达手段,因此也是每一门专业的表达手段。语言学知识肯定可以对其他领域有所帮助。另一方面,其他研究对象提供的分析观点只能是语言学家的优势。从哲学到数学,从文学到心理学,每一门学科的前沿研究都会对语言学研究产生帮助。例如,具有某些数学理解能力会对语言学研究有所帮助;计算机科学方面的相关训练也会提高我们分析或提取数据的能力。
06.
王平:当前,人工智能研究越来越流行。面对人工智能的全球发展趋势,您认为这将给跨学科研究带来哪些新的挑战?
Giuseppe Samo博士:与其他领域的互动是每一项创新的基础,例如语言学与认知科学之间的交流。如今,人文学科的很一大部分正在走向数字化:语言学是最早参与这一变革的学科之一,可以算是整个数字人文学科的领导者。人工智能为语言学家提供了不同的视角来测试所提出模型的可学习性,结果应与经典行为研究和大脑研究进行比较。我不知道语言学未来会发展成什么样子:但毫无疑问,我们的方法和技术必须随着时代的发展而发展。
English Version
01.
Ping Wang: There are many programming languages like C++, Java, Python. Do you think the use of programming languages can make some difference in the linguistic research? Have you ever utilized some programming methods in researches? Would you use an example to elaborate?
Dr. Giuseppe Samo: The ability of programming is definitely an advantage, a skill that can be integrated with the technical linguistic knowledge. I think for example that the ability of mastering python helps in processing, analysing, extracting from and, with the relevant libraries, visualize big amount of data. Luckily, many computer scientists and computational linguists made and are making available already created user-friendly tools that a theoretical linguist can use, even without the relevant knowledge in programming. And if possible, I personally always prefer to adopt this option: using these already existing tools could be considered the best choice terms of time and quality of the data processing. Moreover, it represents a good method for the interaction of the two communities and it is a good way to be always updated on new technologies and methods.
02.
Ping Wang: Computational Linguistics, or Natural Language Processing (NLP), is not a new field. As early as 1946, attempts have been undertaken to use computers to process natural language. In the study of Natural Language Processing, it has been paid much attention to the computer’s processing and understanding of human language, especially the complex sentences. So how does the analysis of the syntactic structure contribute to the further improvement of NLP?
Dr. Giuseppe Samo: Throughout the history NLP, the formal study of syntax has played an important role. Naturally, data-driven pure statistical methods have shown great performance, leaving a gap behind. However, in the recent literature of NLP, the linguistic theory is regaining its position developing formal mechanisms to clearly describe certain operations: for example, an extremely common practice in many AI systems is the so-called transformation of words into vectors, resulting, roughly speaking, from the statistical counts given by frequencies of words co-occurring with other words. Implementing linguistic theory, we can show that not every word has the same nature: adjective, nouns, subjects, objects and so on, represent different “mathematical” objects such as vectors or matrices (see for example, Baroni M. & Zamparelli R., 2010, Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space, Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, 1183–1193). The study of complex structures and human still remain the target to evaluate the accuracy of the performance: for example, in the work by Gulordava and collaborators (Gulordava et al. 2018. Colorless Green Recurrent Networks Dream Hierarchically. In Proceedings of NAACL, 1195–1205), human control groups were established to detect asymmetries with the results of the neural networks.
03.
Ping Wang: When using corpus for language research, it is usually carried out by analyzing the frequency of an entity in a large amount of data, which is the quantitative method. In your opinion, how do we evaluate the advantages and disadvantages of quantitative methods of language research?
Dr. Giuseppe Samo: This is a longstanding debate in grammar, since the earliest discussion of Chomsky in the 1960s and his famous quote Colorless green ideas sleep furiously. I personally place myself in the direction provided by Prof. Merlo (Merlo P. 2016. Quantitative computational syntax: some initial results, Italian Journal of Computational Linguistics. 2016. 2, 1. 11–30), in stating that frequency should be considered the dependent variable on the study of grammar: I personally do not take neither frequency as a factor (in the spirit of usage-based grammars, (see for example, Ibbotson P. 2013. The Scope of Usage-Based Theory, Frontiers in Psychology, for an overview) nor as a totally irrelevant element. In a recent paper with professor Paola Merlo (Samo, G., Merlo, P., 2019, Intervention effects in object relatives in English and Italian: a study in quantitative computational syntax, Quasy, 46 – 56), we discussed that frequencies depend on the complexity of configurations: if something is complex in the computation, it will surface less than a much easier construction. However, we still face the problem of counterfactual data if we work with corpora. In other words, we can say that quantitative methods enrich the qualitative dimensions and the two types of analysis are not in full contrast.
04.
Ping Wang: You once said that syntactic cartographic study is of great significance to the development of artificial intelligence. On the contrary, will artificial intelligence promote the development of linguistics? Could you please give an example?
Dr. Giuseppe Samo: Bear in mind that Artificial Intelligence is mainly equivalent to statistical computation. This question is a very important topic, especially concerning how linguistic theory can understand what happens in a neural network, which nowadays still represent a black box. A recent article by professor Tal Linzen and Marco Baroni (2020, Syntactic Structure from Deep Learning, in press at Annual Reviews of Linguistics, downloadable at: https://arxiv.org/abs/2004.10827) aims to revise the existing literature and, among other things, discusses some methods to interpret these results from a theoretical point of view. What we need to do is to map the results of the machine into our formal representations and investigate. On the other side, from a practical point of view, better data given by adding syntactic information via linguistic annotation can help machine improve in many linguistic tasks, from machine translation, especially in under-resourced languages, to information retrieval.
05.
Ping Wang: For a long time, linguistics and computer science have been regarded as unrelated majors. If students majoring in linguistics want to study artificial intelligence linguistics or computational linguistics, do you have any constructive suggestions for these students?
Dr. Giuseppe Samo: Every science (and therefore every major) is conveyed by the means of language. A linguistic knowledge can definitely help in any other field. On the other hand, the point of analysis offered by other objects of investigation can only be an advantage for a linguist. From philosophy to mathematics, from literature to psychology, every advanced study of a subject is helpful. For example, the ability of being able to understand the mathematics beyond certain elements could be of a great help, as well as be able to program to analyse or extract data at our will with the knowledge provided by a training in computer science.
06.
Ping Wang: At present, artificial intelligence study has become more and more popular. Facing the global trend of artificial intelligence, what new challenges do you think it will bring to interdisciplinary research?
Dr. Giuseppe Samo: The interaction with other domains is at basis for every innovation, such it was the dialogue of linguistics with cognitive studies. Nowadays, a good part of humanities is going digital: linguistics was one of the earliest sciences to have attended such a transformation and it can represent a leader for digital humanities as a whole. Artificial intelligence gives linguists different perspectives in testing the learnability of the proposed models and the results should be compared with classical behavioural studies and brain studies. I don’t know what will be the future of linguistics: definitely expanding our methodologies and technologies following the times is required.
往期推荐
Charles Yang & Ken Wexler | Nature and Nurture
“句法制图丝路之约”之Aquiles Tescari Neto篇
本文版权归“理论语言学五道口站”所有,转载请联系本平台。
编辑:马晓彤 闫玉萌 陈金玉 王平
排版:马晓彤 闫玉萌 高洁
审校:王丽媛 李芳芳